Background and Motivation

As graduate students, debt is all around us. Fortunately, we are also data scientists, and data is all around us, too. Most of the major milestones in life - such as graduating from college or a graduate program, or buying a house - require taking on debt. A report from Time’s Money Magazine based on the Federal Reserve’s Survey of Consumer Finances found that on average, Americans under 35 owe $67,400. For middle-aged Americans, the average is even higher, ranging from $108,300 for 55-64 year olds to $134,600 for 45-54 year olds (http://time.com/money/5233033/average-debt-every-age/).

We are interested in digging deeper into how loan applications are considered, and better understanding factors which might be considered by lending agents when applying for a loan, from the applicant’s perspective. Alternatively, our analysis could be useful from the lender’s perspective in identifying factors which predict loan defaults. The goal for our final project was to build a prediction model using the LendingClub* loans data set. Specifically, we wanted to build a model that could predict a dichotomized outcome of loan status (which will be explained in more detail below) using the variables given in the data set.

On this page, we will walk you through our preliminary data exploration. On the methods page, we will walk you through our process for building our final prediction model.

*Lending Club is a United States based peer-to-peer lending platform. It is the world’s largest of such companies, facilitating loan exchange between borrowers and investors. Lending Club enables borrows to create unsecured personal loans between $1,000 and $40,000. Investors can search a database of loan listings on their website and select any number of loans to invest in.

Data Exploration

rm(list = ls())
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(tidyverse)
library(funModeling)
library(caret)
library(VIM)
library(mice)
library(ggcorrplot)
library(plotly)
library(pROC)
library(lubridate)
library(glmnet)
library(broom)
library(MASS)
library(usmap)
library(RColorBrewer)
set.seed(1)
set_dir <- '/Users/shuheimiyasaka/Google Drive/BST 260 Final Project/For Submission'

Our data is available via kaggle at https://www.kaggle.com/wendykan/lending-club-loan-data. This data is unique from most other financial institution’s data because of the lending method used by ‘LendingClub’. Headquartered in San Francisco, LendingClub connects borrowers applying for personal loans, auto refinancing, business loans, and elective medical procedures with investors. LendingClub reports that it is America’s largest online marketplace, and emphasizes the digital aspects of its model (https://www.lendingclub.com/). LendingClub also services the loans, and therefore maintains data on the loans’ statuses, as well as information about the loan application.

Data Cleaning

setwd(set_dir)
load('./loan.Rdata')
#loan.dat <- read.csv('loan.csv', header = TRUE)
#save(loan.dat, file = "./loan.RData")

The data set has 887,379 records with 74 variables.

dim(loan.dat)
## [1] 887379     74
names(loan.dat)
##  [1] "id"                          "member_id"                  
##  [3] "loan_amnt"                   "funded_amnt"                
##  [5] "funded_amnt_inv"             "term"                       
##  [7] "int_rate"                    "installment"                
##  [9] "grade"                       "sub_grade"                  
## [11] "emp_title"                   "emp_length"                 
## [13] "home_ownership"              "annual_inc"                 
## [15] "verification_status"         "issue_d"                    
## [17] "loan_status"                 "pymnt_plan"                 
## [19] "url"                         "desc"                       
## [21] "purpose"                     "title"                      
## [23] "zip_code"                    "addr_state"                 
## [25] "dti"                         "delinq_2yrs"                
## [27] "earliest_cr_line"            "inq_last_6mths"             
## [29] "mths_since_last_delinq"      "mths_since_last_record"     
## [31] "open_acc"                    "pub_rec"                    
## [33] "revol_bal"                   "revol_util"                 
## [35] "total_acc"                   "initial_list_status"        
## [37] "out_prncp"                   "out_prncp_inv"              
## [39] "total_pymnt"                 "total_pymnt_inv"            
## [41] "total_rec_prncp"             "total_rec_int"              
## [43] "total_rec_late_fee"          "recoveries"                 
## [45] "collection_recovery_fee"     "last_pymnt_d"               
## [47] "last_pymnt_amnt"             "next_pymnt_d"               
## [49] "last_credit_pull_d"          "collections_12_mths_ex_med" 
## [51] "mths_since_last_major_derog" "policy_code"                
## [53] "application_type"            "annual_inc_joint"           
## [55] "dti_joint"                   "verification_status_joint"  
## [57] "acc_now_delinq"              "tot_coll_amt"               
## [59] "tot_cur_bal"                 "open_acc_6m"                
## [61] "open_il_6m"                  "open_il_12m"                
## [63] "open_il_24m"                 "mths_since_rcnt_il"         
## [65] "total_bal_il"                "il_util"                    
## [67] "open_rv_12m"                 "open_rv_24m"                
## [69] "max_bal_bc"                  "all_util"                   
## [71] "total_rev_hi_lim"            "inq_fi"                     
## [73] "total_cu_tl"                 "inq_last_12m"
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
meta_loans[order(-meta_loans$p_na),]

As part of data exploration, we examined with percentage of “zeros”, missing records, and unique values in the data set per variable as shown above. From the table above, we notice a number of variables with significant amount of missing data.

Based on examining the data set and reading the data dictionary, we decided to immediately rule out the following variables from our model: id, member_id, url, and desc.

cols.2.remove <- c('id', 'member_id', 'url', 'desc')

We decided to exclude variables with more than 10% missing data (19 variables).

missing.data.col <- meta_loans$variable[meta_loans$p_na > 10.]
missing.data.col
##  [1] "mths_since_last_delinq"      "mths_since_last_record"     
##  [3] "mths_since_last_major_derog" "annual_inc_joint"           
##  [5] "dti_joint"                   "open_acc_6m"                
##  [7] "open_il_6m"                  "open_il_12m"                
##  [9] "open_il_24m"                 "mths_since_rcnt_il"         
## [11] "total_bal_il"                "il_util"                    
## [13] "open_rv_12m"                 "open_rv_24m"                
## [15] "max_bal_bc"                  "all_util"                   
## [17] "inq_fi"                      "total_cu_tl"                
## [19] "inq_last_12m"
length(missing.data.col)
## [1] 19
cols.2.remove <- c(cols.2.remove, missing.data.col)
meta_loans[order(meta_loans$unique),]
cols.2.remove <- c(cols.2.remove, 'policy_code')

We also decided to remove policy_code since it only has one unique value.

At this point, we had 50 potential covariates:

cols.2.keep <- !(colnames(loan.dat) %in% cols.2.remove)
colnames(loan.dat)[cols.2.keep]
##  [1] "loan_amnt"                  "funded_amnt"               
##  [3] "funded_amnt_inv"            "term"                      
##  [5] "int_rate"                   "installment"               
##  [7] "grade"                      "sub_grade"                 
##  [9] "emp_title"                  "emp_length"                
## [11] "home_ownership"             "annual_inc"                
## [13] "verification_status"        "issue_d"                   
## [15] "loan_status"                "pymnt_plan"                
## [17] "purpose"                    "title"                     
## [19] "zip_code"                   "addr_state"                
## [21] "dti"                        "delinq_2yrs"               
## [23] "earliest_cr_line"           "inq_last_6mths"            
## [25] "open_acc"                   "pub_rec"                   
## [27] "revol_bal"                  "revol_util"                
## [29] "total_acc"                  "initial_list_status"       
## [31] "out_prncp"                  "out_prncp_inv"             
## [33] "total_pymnt"                "total_pymnt_inv"           
## [35] "total_rec_prncp"            "total_rec_int"             
## [37] "total_rec_late_fee"         "recoveries"                
## [39] "collection_recovery_fee"    "last_pymnt_d"              
## [41] "last_pymnt_amnt"            "next_pymnt_d"              
## [43] "last_credit_pull_d"         "collections_12_mths_ex_med"
## [45] "application_type"           "verification_status_joint" 
## [47] "acc_now_delinq"             "tot_coll_amt"              
## [49] "tot_cur_bal"                "total_rev_hi_lim"
length(colnames(loan.dat)[cols.2.keep])
## [1] 50
loan.dat <- loan.dat[, cols.2.keep]

We also decided to remove 6 records with missing or zero annual income since we felt this information was a requirement for obtaining a loan and a covariate that we must definitely include in our final model (and didn’t feel we could impute these values properly)!

query = loan.dat$annual_inc == 0.
query.na = is.na(query)
if (sum(query.na) > 0){
  query[query.na] = TRUE
}
if (sum(query) > 0){
  loan.dat = loan.dat[!query,]
} else stop('unexpected case')

With the remaining set of records and covariates, we decided to examine the pairwise correlation of covariates:

meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
numeric_cols <- meta_loans$variable[meta_loans$type == 'numeric']

cor.dat <- cor(loan.dat[,numeric_cols], loan.dat[,numeric_cols])
plot_ly(x=colnames(cor.dat), 
        y=rownames(cor.dat), 
        z = cor.dat, type = "heatmap", colorscale="Greys")
#ggcorrplot(cor(loan.dat[,numeric_cols]))
#aggr(loan.dat, combined=T, cex.axis=0.6)

We notice from the plot above that there are a few covariates that are highly correlated (which is not unexpected).

We also calculated basic summary statistics of our covariates to help us better understand the data:

summary(loan.dat)
##    loan_amnt      funded_amnt    funded_amnt_inv         term       
##  Min.   :  500   Min.   :  500   Min.   :    0    36 months:621119  
##  1st Qu.: 8000   1st Qu.: 8000   1st Qu.: 8000    60 months:266254  
##  Median :13000   Median :13000   Median :13000                      
##  Mean   :14755   Mean   :14742   Mean   :14703                      
##  3rd Qu.:20000   3rd Qu.:20000   3rd Qu.:20000                      
##  Max.   :35000   Max.   :35000   Max.   :35000                      
##                                                                     
##     int_rate      installment      grade        sub_grade     
##  Min.   : 5.32   Min.   :  15.67   A:148198   B3     : 56323  
##  1st Qu.: 9.99   1st Qu.: 260.71   B:254535   B4     : 55626  
##  Median :12.99   Median : 382.55   C:245859   C1     : 53387  
##  Mean   :13.25   Mean   : 436.72   D:139541   C2     : 52235  
##  3rd Qu.:16.20   3rd Qu.: 572.60   E: 70705   C3     : 50161  
##  Max.   :28.99   Max.   :1445.46   F: 23046   C4     : 48857  
##                                    G:  5489   (Other):570784  
##             emp_title          emp_length      home_ownership  
##                  : 51451   10+ years:291569   ANY     :     3  
##  Teacher         : 13469   2 years  : 78870   MORTGAGE:443555  
##  Manager         : 11240   < 1 year : 70601   NONE    :    46  
##  Registered Nurse:  5525   3 years  : 70026   OTHER   :   182  
##  Owner           :  5376   1 year   : 57095   OWN     : 87470  
##  RN              :  5355   5 years  : 55704   RENT    :356117  
##  (Other)         :794957   (Other)  :263508                    
##    annual_inc           verification_status     issue_d      
##  Min.   :   1200   Not Verified   :266744   Oct-2015: 48631  
##  1st Qu.:  45000   Source Verified:329558   Jul-2015: 45962  
##  Median :  65000   Verified       :291071   Dec-2015: 44341  
##  Mean   :  75028                            Oct-2014: 38782  
##  3rd Qu.:  90000                            Nov-2015: 37529  
##  Max.   :9500000                            Aug-2015: 35886  
##                                             (Other) :636242  
##              loan_status     pymnt_plan               purpose      
##  Current           :601777   n:887363   debt_consolidation:524214  
##  Fully Paid        :207723   y:    10   credit_card       :206181  
##  Charged Off       : 45248              home_improvement  : 51829  
##  Late (31-120 days): 11591              other             : 42890  
##  Issued            :  8460              major_purchase    : 17277  
##  In Grace Period   :  6253              small_business    : 10377  
##  (Other)           :  6321              (Other)           : 34605  
##                      title           zip_code        addr_state    
##  Debt consolidation     :414000   945xx  :  9770   CA     :129517  
##  Credit card refinancing:164330   750xx  :  9417   NY     : 74082  
##  Home improvement       : 40112   112xx  :  9272   TX     : 71136  
##  Other                  : 31892   606xx  :  8641   FL     : 60935  
##  Debt Consolidation     : 15760   300xx  :  8126   IL     : 35476  
##  Major purchase         : 12051   100xx  :  7605   NJ     : 33256  
##  (Other)                :209228   (Other):834542   (Other):482971  
##       dti           delinq_2yrs      earliest_cr_line  inq_last_6mths   
##  Min.   :   0.00   Min.   : 0.0000   Aug-2001:  6659   Min.   : 0.0000  
##  1st Qu.:  11.91   1st Qu.: 0.0000   Aug-2000:  6529   1st Qu.: 0.0000  
##  Median :  17.65   Median : 0.0000   Oct-2000:  6322   Median : 0.0000  
##  Mean   :  18.13   Mean   : 0.3144   Oct-2001:  6154   Mean   : 0.6946  
##  3rd Qu.:  23.95   3rd Qu.: 0.0000   Aug-2002:  6086   3rd Qu.: 1.0000  
##  Max.   :1092.52   Max.   :39.0000   Sep-2000:  5918   Max.   :33.0000  
##                    NA's   :25        (Other) :849705   NA's   :25       
##     open_acc        pub_rec          revol_bal         revol_util    
##  Min.   : 0.00   Min.   : 0.0000   Min.   :      0   Min.   :  0.00  
##  1st Qu.: 8.00   1st Qu.: 0.0000   1st Qu.:   6443   1st Qu.: 37.70  
##  Median :11.00   Median : 0.0000   Median :  11875   Median : 56.00  
##  Mean   :11.55   Mean   : 0.1953   Mean   :  16921   Mean   : 55.07  
##  3rd Qu.:14.00   3rd Qu.: 0.0000   3rd Qu.:  20829   3rd Qu.: 73.60  
##  Max.   :90.00   Max.   :86.0000   Max.   :2904836   Max.   :892.30  
##  NA's   :25      NA's   :25                          NA's   :498     
##    total_acc      initial_list_status   out_prncp     out_prncp_inv  
##  Min.   :  1.00   f:456843            Min.   :    0   Min.   :    0  
##  1st Qu.: 17.00   w:430530            1st Qu.:    0   1st Qu.:    0  
##  Median : 24.00                       Median : 6459   Median : 6456  
##  Mean   : 25.27                       Mean   : 8403   Mean   : 8400  
##  3rd Qu.: 32.00                       3rd Qu.:13659   3rd Qu.:13654  
##  Max.   :169.00                       Max.   :49373   Max.   :49373  
##  NA's   :25                                                          
##   total_pymnt    total_pymnt_inv total_rec_prncp total_rec_int    
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0.0  
##  1st Qu.: 1915   1st Qu.: 1900   1st Qu.: 1201   1st Qu.:  441.5  
##  Median : 4895   Median : 4862   Median : 3215   Median : 1073.3  
##  Mean   : 7559   Mean   : 7521   Mean   : 5758   Mean   : 1754.8  
##  3rd Qu.:10617   3rd Qu.:10566   3rd Qu.: 8000   3rd Qu.: 2238.3  
##  Max.   :57778   Max.   :57778   Max.   :35000   Max.   :24205.6  
##                                                                   
##  total_rec_late_fee   recoveries       collection_recovery_fee
##  Min.   :  0.0000   Min.   :    0.00   Min.   :   0.000       
##  1st Qu.:  0.0000   1st Qu.:    0.00   1st Qu.:   0.000       
##  Median :  0.0000   Median :    0.00   Median :   0.000       
##  Mean   :  0.3967   Mean   :   45.92   Mean   :   4.881       
##  3rd Qu.:  0.0000   3rd Qu.:    0.00   3rd Qu.:   0.000       
##  Max.   :358.6800   Max.   :33520.27   Max.   :7002.190       
##                                                               
##    last_pymnt_d    last_pymnt_amnt     next_pymnt_d    last_credit_pull_d
##  Jan-2016:470148   Min.   :    0.0   Feb-2016:553404   Jan-2016:730572   
##  Dec-2015:150861   1st Qu.:  280.2           :252971   Dec-2015: 19308   
##          : 17659   Median :  462.8   Jan-2016: 78195   Nov-2015: 11490   
##  Oct-2015: 16000   Mean   : 2164.2   Mar-2011:   107   Oct-2015: 10419   
##  Jul-2015: 14483   3rd Qu.:  831.2   Apr-2011:   101   Sep-2015: 10087   
##  Nov-2015: 13981   Max.   :36475.6   Feb-2011:    91   Jul-2015:  8642   
##  (Other) :204241                     (Other) :  2504   (Other) : 96855   
##  collections_12_mths_ex_med   application_type 
##  Min.   : 0.00000           INDIVIDUAL:886864  
##  1st Qu.: 0.00000           JOINT     :   509  
##  Median : 0.00000                              
##  Mean   : 0.01438                              
##  3rd Qu.: 0.00000                              
##  Max.   :20.00000                              
##  NA's   :141                                   
##    verification_status_joint acc_now_delinq       tot_coll_amt    
##                 :886864      Min.   : 0.000000   Min.   :      0  
##  Not Verified   :   281      1st Qu.: 0.000000   1st Qu.:      0  
##  Source Verified:    61      Median : 0.000000   Median :      0  
##  Verified       :   167      Mean   : 0.004991   Mean   :    226  
##                              3rd Qu.: 0.000000   3rd Qu.:      0  
##                              Max.   :14.000000   Max.   :9152545  
##                              NA's   :25          NA's   :70272    
##   tot_cur_bal      total_rev_hi_lim 
##  Min.   :      0   Min.   :      0  
##  1st Qu.:  29853   1st Qu.:  13900  
##  Median :  80559   Median :  23700  
##  Mean   : 139458   Mean   :  32069  
##  3rd Qu.: 208205   3rd Qu.:  39800  
##  Max.   :8000078   Max.   :9999999  
##  NA's   :70272     NA's   :70272